[Core][Feat] Add max-waiting-queue-time parameter to reject requests#37413
[Core][Feat] Add max-waiting-queue-time parameter to reject requests#37413chaunceyjiang wants to merge 2 commits intovllm-project:mainfrom
Conversation
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a max-waiting-queue-time parameter to reject requests when the server is under high load. This is achieved by tracking the average queue time of recent requests using a new QueueTimeTracker class. The feature is well-integrated, with the new parameter added to EngineArgs and propagated down to the serving layers. The logic to check the queue time and reject requests with a 503 error is implemented in OpenAIServing. The QueueTimeTracker itself uses a sliding window with time-based decay, which is a solid approach. My review found one area for improvement in the QueueTimeTracker implementation regarding an unused variable, which I've commented on.
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
|
@njhill Would you mind taking a look when you get the chance? :) |
Purpose
Add max-waiting-queue-time parameter to reject requests
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.